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Filter Structures 


A realizable filter must require only a finite number of computations per 
output sample. For linear, causal, time-Invariant filters, this restricts one to 
rational transfer functions of the form 
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H(z) 
Assuming no pole-zero cancellations, H(z) is FIR if Vi,i > 0: (a; = 0), 
and IIR otherwise. Filter structures usually implement rational transfer 
functions as difference equations. 


Whether FIR or IIR, a given transfer function can be implemented with 
many different filter structures. With infinite-precision data, coefficients, 
and arithmetic, all filter structures implementing the same transfer function 
produce the same output. However, different filter strucures may produce 
very different errors with quantized data and finite-precision or fixed-point 
arithmetic. The computational expense and memory usage may also differ 
greatly. Knowledge of different filter structures allows DSP engineers to 
trade off these factors to create the best implementation. 


FIR Filter Structures 


Consider causal FIR filters: y(n) = cone h(k)a(n — k); this can be realized using 
the following structure 


or in a different notation 
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This is called the direct-form FIR filter structure. 


There are no closed loops (no feedback) in this structure, so it is called a non- 
recursive structure. Since any FIR filter can be implemented using the direct-form, 
non-recursive structure, it is always possible to implement an FIR filter non- 
recursively. However, it is also possible to implement an FIR filter recursively, and 
for some special sets of FIR filter coefficients this is much more efficient. 


Example: 


where 


But note that 
y(n) = y(n — 1) + a(n) — e(n — M) 
This can be implemented as 


x(n) 


Instead of costing MM — 1 adds/output point, this comb filter costs only two 
adds/output. 


Exercise: 
Problem: Is this stable, and if not, how can it be made so? 


IIR filters must be implemented with a recursive structure, since that's the only way a 
finite number of elements can generate an infinite-length impulse response in a linear, 
time-invariant (LTI) system. Recursive structures have the advantages of being able 
to implement IIR systems, and sometimes greater computational efficiency, but the 
disadvantages of possible instability, limit cycles, and other deletorious effects that 
we will study shortly. 


Transpose-form FIR filter structures 


The flow-graph-reversal theorem says that if one changes the directions of all the 
arrows, and inputs at the output and takes the output from the input of a reversed 
flow-graph, the new system has an identical input-output relationship to the original 


flow-graph. 
Direct-form FIR structure 
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Cascade structures 
The z-transform of an FIR filter can be factored into a cascade of short-length filters 
bop tbyz-1 + bgz FF +... + bz” = bo (1 — nie) (1 _ wae \ ahh (1 — tae} 


where the z; are the zeros of this polynomial. Since the coefficients of the polynomial 
are usually real, the roots are usually complex-conjugate pairs, so we generally 
combine (1 = oe) (1 _ ae) into one quadratic (length-2) section with real 
coefficients 


(1- we) (1 - zz) = 1-2K(z,;)z71 + (\z;|)°2 2 = Hi(z) 


The overall filter can then be implemented in a cascade structure. 


This is occasionally done in FIR filter implementation when one or more of the short- 
length filters can be implemented efficiently. 


Lattice Structure 


It is also possible to implement FIR filters in a lattice structure: this is sometimes 
used in adaptive filtering 


x(n) 


IIR Filter Structures 


IIR (Infinite Impulse Response) filter structures must be recursive (use feedback); an infinite number of 
coefficients could not otherwise be realized with a finite number of computations per sample. 


N(z) = bp + by 271 boz~? one byz 


D(z) 1tayz-1 + agz-2 +... +ayz-% 


H(2)'=> 


The corresponding time-domain difference equation is 


y(n) = (— (aiy(n — 1))) — agy(n — 2) +... —any(n — N) + box(0) + bia (n — 1) +...+ bya(n — M) 


Direct-form I IIR Filter Structure 


The difference equation above is implemented directly as written by the Direct-Form I IIR Filter Structure. 
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Note that this is a cascade of two systems, N(z) and Tee If we reverse the order of the filters, the overall 


system is unchanged: The memory elements appear in the middle and store identical values, so they can be 
combined, to form the Direct-Form II IIR Filter Structure. 


Direct-Form II IIR Filter Structure 


This structure is canonic: (i.e., it requires the minimum number of memory elements). 


Flowgraph reversal gives the 


Transpose-Form IIR Filter Structure 


y(n) 


Usually we design IIR filters with N = M, but not always. 


Obviously, since all these structures have identical frequency response, filter structures are not unique. We 
consider many different structures because 


1. Depending on the technology or application, one might be more convenient than another 
2. The response in a practical realization, in which the data and coefficients must be quantized, may differ 
substantially, and some structures behave much better than others with quantization. 


The Cascade-Form IIR filter structure is one of the least sensitive to quantization, which is why it is the most 
commonly used IIR filter structure. 


IIR Cascade Form 
The numerator and denominator polynomials can be factored 


bo + byz +... +by2z-™ _ bo Tie 2 — 2 
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H(z) = 


and implemented as a cascade of short IIR filters. 


Since the filter coefficients are usually real yet the roots are mostly complex, we actually implement these as 
second-order sections, where comple-conjugate pole and zero pairs are combined into second-order sections 


with real coefficients. The second-order sections are usually implemented with either the Direct-Form II or 
Transpose-Form structure. 


Parallel form 


A rational transfer function can also be written as 


bo +byz-1 +... + byz7-™ ij gE & ; ; 
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which by linearity can be implemented as 


x(n) 


As before, we combine complex-conjugate pole pairs into second-order sections with real coefficients. 


The cascade and parallel forms are of interest because they are much less sensitive to coefficient quantization 
than higher-order structures, as analyzed in later modules in this course. 


Other forms 


There are many other structures for IIR filters, such as wave digital filter structures, lattice-ladder, all-pass- 
based forms, and so forth. These are the result of extensive research to find structures which are 
computationally efficient and insensitive to quantization error. They all represent various tradeoffs; the best 
choice in a given context is not yet fully understood, and may never be. 


State- Variable Representation of Discrete-Time Systems 


State and the State-Variable Representation 


State 


the minimum additional information at time n, which, along with all current and future input values, is 
necessary to compute all future outputs. 


Essentially, the state of a system is the information held in the delay registers in a filter structure or signal 
flow graph. 


Note: Any LTI (linear, time-invariant) system of finite order M can be represented by a state-variable 
description 


x(n +1) = Ax(n) + Bu(n) 
y(n) = Cax(n) + Du(n) 


where a is an M x 1 "state vector," u(n) is the input at time n, y(7) is the output at time n; A is an M x M 
matrix, Bis an M x 1 vector, Cisa1x M vector, and Dis a1 x1 scalar. 


One can always obtain a state-variable description of a signal flow graph. 


Example: 
3rd-Order IIR 


y(n) = (— (ary(n — 1))) — agy(n — 2) — agy(n — 3) + box(n) + byx(n — 1) + box(n — 2) + b3x(n — 3) 


u(n) 


y(n) = (—(asbo) —(a2bo) —(aibo)) x2(n) + (bo)u(n) 


Exercise: 


Problem: Is the state-variable description of a filter H(z) unique? 


Exercise: 


Problem: Does the state-variable description fully describe the signal flow graph? 


State- Variable Transformation 


Suppose we wish to define a new set of state variables, related to the old set by a linear transformation: 
q(n) = Ta(n), where T is a nonsingular M x M matrix, and q(7) is the new state vector. We wish the 
overall system to remain the same. Note that 2(n) = T’~'g(n), and thus 


a(n +1) = Aw(n) + Bu(n) > T'gq(n) = AT ‘q(n) + Bu(n) > q(n) = TAT 'q(n) + TBu(n) 
(nt) = Ca(n) + Du(n) + y(n) =CT“4(n) + Du(n) 


This defines a new state system with an input-output behavior identical to the old system, but with different 
internal memory contents (states) and state matrices. 


q(n) = Aq(n) + Bu(n) 
y(n) = Cq(n) + Du(n) 
A=TAT-.B=TB.C=CT"D=D 


These transformations can be used to generate a wide variety of alternative stuctures or implementations of a 
filter. 


Transfer Function and the State-Variable Description 
Taking the z transform of the state equations 
Z|a(n + 1)] = Z| Aw(n) + Bu(n)| 
Z\y(n)] = Z[Ca(n) + Du(n)] 
2X(z) = AX(z) + BU(z) 


Note: X(z) is a vector of scalar z-transforms X(z)’ = (Xi(z) X2(z) ...) 


Y(z) = CX(n) + DU(n) 


(2I — A)X(z) = BU(z) > X(z) = (2I — A) BU(z) 


Equation: 
Y(z) = O(zI — A)'BU(z) + DU(z) 
= (c(- (21)) 1B + D) U(2z) 
and thus 


H(z) = C(zI — A)'B+D 


ne = Ee. 
Note that since (zI — A) * = a a 
The denominator polynomial is D(z) = det (zI — A). A discrete-time state system is thus stable if the MZ 
roots of det (zI — A) (i.e., the poles of the digital filter) are all inside the unit circle. 


, this transfer function is an Mth-order rational fraction in z. 


Consider the transformed state system with A= PATH, B= TB, C=CT Eh D=D: 
Equation: 


H(z) = C(al > A) "B+D 
= CT-\(2I TAT“) "TB+D 
= OT-\(T (zl — A)T)"TB+D 
= OTT) ""(2I — A) "TTB+D 
= O(z2I— A)'B+D 


This proves that state-variable transformation doesn't change the transfer function of the underlying system. 
However, it can provide alternate forms that are less sensitive to coefficient quantization or easier to analyze, 
understand, or implement. 


State-variable descriptions of systems are useful because they provide a fairly general tool for analyzing all 
systems; they provide a more detailed description of a signal flow graph than does the transfer function 
(although not a full description); and they suggest a large class of alternative implementations. They are even 
more useful in control theory, which is largely based on state descriptions of systems. 


Fixed-Point Number Representation 


Fixed-point arithmetic is generally used when hardware cost, speed, or 
complexity is important. Finite-precision quantization issues usually arise in 
fixed-point systems, so we concentrate on fixed-point quantization and 
error analysis in the remainder of this course. For basic signal processing 
computations such as digital filters and FFTs, the magnitude of the data, the 
internal states, and the output can usually be scaled to obtain good 
performance with a fixed-point implementation. 


Two's-Complement Integer Representation 


As far as the hardware is concerned, fixed-point number systems represent 
data as B-bit integers. The two's-complement number system is usually 
used: 


7 hee integer representation if 0<k<2?-!-1 
- |bit-by-bit inverse(—k) +1 if —221<k<0 


The most significant bit is known at the sign bit; it is 0 when the number is 
non-negative; 1 when the number is negative. 


Fractional Fixed-Point Number Representation 


For the purposes of signal processing, we often regard the fixed-point 
numbers as binary fractions between [—1, 1), by implicitly placing a 
decimal point after the sign bit. 


or 


r=—bo+ 0b,2° 
41 


This interpretation makes it clearer how to implement digital filters in 
fixed-point, at least when the coefficients have a magnitude less than 1. 


Truncation Error 


Consider the multiplication of two binary fractions 


Fractional Integer 
Interpretation Interpretation 


0.10 1/2 9) 
x 0.11 x 3/4 x 43 
. 010 


.O10 
0.00 


0.0110 3/8 6 


Note that full-precision multiplication almost doubles the number of bits; if 
we wish to return the product to a B-bit representation, we must truncate 
the B — 1 least significant bits. However, this introduces truncation error 
(also known as quantization error, or roundoff error if the number is 
rounded to the nearest B-bit fractional value rather than truncated). Note 
that this occurs after multiplication. 


Overflow Error 


Consider the addition of two binary fractions; 


Fractional Integer 
Interpretation Interpretation 


0.10 1/2 9 
+ 0.11 + 3/4 i 2 


1.01 5/4 = -1/4 p= =] 


Note the occurence of wraparound overflow; this only happens with 
addition. Obviously, it can be a bad problem. 


There are thus two types of fixed-point error: roundoff error, associated 
with data quantization and multiplication, and overflow error, associated 
with data quantization and additions. In fixed-point systems, one must 
strike a balance between these two error sources; by scaling down the data, 
the occurence of overflow errors is reduced, but the relative size of the 
roundoff error is increased. 


Note: Since multiplies require a number of additions, they are especially 
expensive in terms of hardware (with a complexity proportional to B, By, 
where B,, is the number of bits in the data, and B;, is the number of bits in 
the filter coefficients). Designers try to minimize both B, and Bp, and 
often choose B, ~ B;! 


Fixed-Point Quantization 


The fractional B-bit two's complement number representation evenly 
distributes 2” quantization levels between —1 and 1 — 2-(8-)) The 
spacing between quantization levels is then 


2 
a —_ 9—(B-1) es Ap 


Any signal value falling between two levels is assigned to one of the two 
levels. 


Xg = Q|z| is our notation for quantization. e = |x] — « is then the 
quantization error. 


One method of quantization is rounding, which assigns the signal value to 
the nearest level. The maximum error is thus —- ao, 
QE] 


Another common scheme, which is often easier to implement in hardware, 
is truncation. Q[z] assigns x to the next lowest level. 


H | 
H 1 A 
+ g— |} +g } oe 
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The worst-case error with truncation is A = pA(Baty which is twice as 
large as with rounding. Also, the error is always negative, so on average it 
may have a non-zero mean (i.e., a bias component). 


Overflow is the other problem. There are two common types: two's 
complement (or wraparound) overflow, or saturation overflow. 
wraparound 


Q[x] 


saturation 
Qi] 


Obviously, overflow errors are bad because they are typically large; two's 
complement (or wraparound) overflow introduces more error than 


saturation, but is easier to implement in hardware. It also has the advantage 
that if the sum of several numbers is between [—1, 1), the final answer will 
be correct even if intermediate sums overflow! However, wraparound 
overflow leaves IIR systems susceptible to zero-input large-scale limit 
cycles, as discussed in another module. As usual, there are many tradeoffs 
to evaluate, and no one right answer for all applications. 


Finite-Precision Error Analysis 


Fundamental Assumptions in finite-precision error analysis 


Quantization is a highly nonlinear process and is very difficult to analyze 
precisely. Approximations and assumptions are made to make analysis 
tractable. 


Assumption #1 


The roundoff or truncation errors at any point in a system at each time are 
random, stationary, and statistically independent (white and independent 
of all other quantizers in a system). 


That is, the error autocorrelation function is re[k] = Elenén+z] = 76k]. 
Intuitively, and confirmed experimentally in some (but not all!) cases, one 
expects the quantization error to have a uniform distribution over the 


interval [—4, 4) for rounding, or (—A, 0] for truncation. 
2°92 g ’ 


In this case, rounding has zero mean and variance 


E\Q|rn] — en] = 0 


Ar 
?) 2 B 
BO En ay 
and truncation has the statistics 
A 
E|Q|zn] = ii _ 2) 
9 Ap’ 
6Q = 99 


Please note that the independence assumption may be very bad (for 
example, when quantizing a sinusoid with an integer period /V). There is 


another quantizing scheme called dithering, in which the values are 
randomly assigned to nearby quantization levels. This can be (and often is) 
implemented by adding a small (one- or two-bit) random input to the signal 
before a truncation or rounding quantizer. 


Dither signal 


Sy 


This is used extensively in practice. Altough the overall error is somewhat 
higher, it is spread evenly over all frequencies, rather than being 
concentrated in spectral lines. This is very important when quantizing 
sinusoidal or other periodic signals, for example. 


Assumption #2 


Pretend that the quantization error is really additive Gaussian noise with 
the same mean and variance as the uniform quantizer. That is, model 


as 


This model is a linear system, which our standard theory can handle easily. 
We model the noise as Gaussian because it remains Gaussian after passing 
through filters, so analysis in a system context is tractable. 


Summary of Useful Statistical Facts 


¢ correlation function r,[k] = E/r,2n+x| 

¢ power spectral density S,(w) = DTFT [r,[n|| 
¢ Note rz(0] = 027 = = f", S2(w) dw 

© Txy[k] = Elz "[n]y[n + ki] 

* cross-spectral density S,,(w) = DTFT [r,,y|n]| 
e Fory =h*z: 


Syy(w) = (|H(w)|)"S2(w) 


e Note that the output noise level after filtering a noise sequence is 


ay? =ryyl0] == fH (w)/)*S-(w) dw 


TJ 4 
so postfiltering quantization noise alters the noise power spectrum and 
may change its variance! 
e For £1, £2 Statistically independent 
Vait2, [k aT ay [k] + Tx, [k] 
S242 (w) = Sx, (w) ra Sx. (w) 


e For independent random variables 


2 2 2 
Oxjt2. = Ox, 1+ On, 


Input Quantization Noise Analysis 


All practical analog-to-digital converters (A/D) must quantize the input 
data. This can be modeled as an ideal sampler followed by a B-bit 
quantizer. 


o— fap 


The signal-to-noise ratio (SNR) of an A/D is 
Equation: 


SNR 


10 log = 


A 2 
= 10log P, — 10log =5- 
= 1l0logP,+ 4.77+ 6.02B 


where FP, is the power in the signal and P,, is the power of the quantization 
noise, which equals its variance if it has a zero mean. The SNR increases by 
6dB with each additional bit. 


Quantization Error in FIR Filters 


In digital filters, both the data at various places in the filter, which are 
continually varying, and the coefficients, which are fixed, must be 
quantized. The effects of quantization on data and coefficients are quite 
different, so they are analyzed separately. 


Data Quantization 


Typically, the input and output in a digital filter are quantized by the analog- 
to-digital and digital-to-analog converters, respectively. Quantization also 
occurs at various points in a filter structure, usually after a multiply, since 
multiplies increase the number of bits. 


Direct-form Structures 


There are two common possibilities for quantization in a direct-form FIR 
filter structure: after each multiply, or only once at the end. 


e e e . 2 
Single-precision accumulate; total variance M * 


. e e 2 
Double-precision accumulate; variance a 


In the latter structure, a double-length accumulator adds all 2B — 1 bits of 
each product into the accumulating sum, and truncates only at the end. 
Obviously, this is much preferred, and should always be used wherever 
possible. All DSP microprocessors and most general-pupose computers 
support double-precision accumulation. 


Transpose-form 


Similarly, the transpose-form FIR filter structure presents two common 
options for quantization: after each multiply, or once at the end. 


Quantize at each stage before storing intermediate sum. Output 
. 2 
variance M a 


or 


y(n) 


+L FO] o 


Store double-precision partial sums. Costs more memory, but variance 
A? 
“12 


The transpose form is not as convenient in terms of supporting double- 
precision accumulation, which is a significant disadvantage of this 
structure. 


Coefficient Quantization 


Since a quantized coefficient is fixed for all time, we treat it differently than 
data quantization. The fundamental question is: how much does the 
quantization affect the frequency response of the filter? 


The quantized filter frequency response is 
DTFT [ho] = DTFT Pine. prec. ae e| = Hine. prec. (w) i H.(w) 


Assuming the quantization model is correct, H,(w) should be fairly 
random and white, with the error spread fairly equally over all frequencies 
w © |—7, 7); however, the randomness of this error destroys any equiripple 
property or any infinite-precision optimality of a filter. 

Exercise: 


Problem: 


What quantization scheme minimizes the L quantization error in 
frequency (minimizes " (|H(w) — Hg (w)|)* d w)? On average, 
how big is this error? 


Ideally, if one knows the coefficients are to be quantized to B bits, one 
should incorporate this directly into the filter design problem, and find the 
M B-bit binary fractional coefficients minimizing the maximum deviation ( 
Lg, error). This can be done, but it is an integer program, which is known 
to be np-hard (i.e., requires almost a brute-force search). This is so 
expensive computationally that it's rarely done. There are some sub-optimal 
methods that are much more efficient and usually produce pretty good 
results. 


Data Quantization in IIR Filters 


Finite-precision effects are much more of a concern with IIR filters than 
with FIR filters, since the effects are more difficult to analyze and 
minimize, coefficient quantization errors can cause the filters to become 
unstable, and disastrous things like large-scale limit cycles can occur. 


Roundoff noise analysis in IIR filters 


Suppose there are several quantization points in an IIR filter structure. By 
our simplifying assumptions about quantization error and Parseval's 
theorem, the quantization noise variance cy ;? at the output of the filter 
from the zth quantizer is 

Equation: 


Oy = ae J, (\Hi(w)|)"SSn,(w) dw 
= 3 J%, (|Hi(w)|)? 4 
=n Don eee lt 2n) 


where o,,” is the variance of the quantization error at the ith quantizer, 
S'S',,(w) is the power spectral density of that quantization error, and 
HH,;(w) is the transfer function from the ith quantizer to the output point. 
Thus for P independent quantizers in the structure, the total quantization 
noise variance is 


n= ont fiw)? aw 


Note that in general, each H;(w), and thus the variance at the output due to 
each quantizer, is different; for example, the system as seen by a quantizer 
at the input to the first delay state in the Direct-Form II IIR filter structure 
to the output, call it 4, is 


with a transfer function 


—2 


z 
1+ a,z7! + agz~? 


H4(z) 


which can be evaluated at z = e” to obtain the frequency response. 


A general approach to find H;(w) is to write state equations for the 
equivalent structure as seen by n;, and to determine the transfer function 
according to H(z) = C(zI — A) 'B+d. 


Exercise: 


Problem: 


The above figure illustrates the quantization points in a typical 
implementation of a Direct-Form II IIR second-order section. What is 
the total variance of the output error due to all of the quantizers in the 
system? 


By making the assumption that each @; represents a noise source that is 
white, independent of the other sources, and additive, 


the variance at the output is the sum of the variances at the output due to 
each noise source: 


4 
2 2 
Ca = 5 Oy 
i=l 


The variance due to each noise source at the output can be determined from 
+f", (Hiv) |)’ Sn,(w) d w; note that S,,(w) = on,? by our 
assumptions, and H;(w) is the transfer function from the noise source to 
the output. 


IIR Coefficient Quantization Analysis 


Coefficient quantization is an important concern with IIR filters, since 
straigthforward quantization often yields poor results, and because 
quantization can produce unstable filters. 


Sensitivity analysis 


The performance and stability of an IIR filter depends on the pole locations, 
so it is important to know how quantization of the filter coefficients a; 
affects the pole locations p;. The denominator polynomial is 


N N 
D(z) =1+ oe ayz * = I] 1— pz 
k=l i=1 


We wish to know oe : , which, for small deviations, will tell us that a 6 


change in a, yields ane = 6 oe change in the pole location. se is the 


sensitivity of the pole location to quantization of az. We can find op : using 


the chain rule. 


0 A(z) _ OA(z) Oz 
Oa, ~~ Oz. Oa, 
{ 
OD; — Oan AP: 
Oa, BAL) 
Oz 2=Pi 


which is 
Equation: 


Op; = ar 
0 an = N = Z=Dj 
ak -(z LTT eis 1—Dyz 1) Di 
_ —piN* 
HWY...” 
Tj =545,1 PsP 


Note that as the poles get closer together, the sensitivity increases greatly. 
So as the filter order increases and more poles get stuffed closer together 
inside the unit circle, the error introduced by coefficient quantization in the 
pole locations grows rapidly. 


How can we reduce this high sensitivity to IIR filter coefficient 
quantization? 


Solution 


Cascade or parallel form implementations! The numerator and denominator 
polynomials can be factored off-line at very high precision and grouped into 
second-order sections, which are then quantized section by section. The 
sensitivity of the quantization is thus that of second-order, rather than N-th 
order, polynomials. This yields major improvements in the frequency 
response of the overall filter, and is almost always done in practice. 


Note that the numerator polynomial faces the same sensitivity issues; the 
cascade form also improves the sensitivity of the zeros, because they are 
also factored into second-order terms. However, in the parallel form, the 
zeros are globally distributed across the sections, so they suffer from 
quantization of all the blocks. Thus the cascade form preserves zero 
locations much better than the parallel form, which typically means that the 
stopband behavior is better in the cascade form, so it is most often used in 
practice. 


Note: On the basis of the preceding analysis, it would seem important to 
use cascade structures in FIR filter implementations. However, most FIR 


filters are linear-phase and thus symmetric or anti-symmetric. As long as 
the quantization is implemented such that the filter coefficients retain 
symmetry, the filter retains linear phase. Furthermore, since all zeros off 
the unit circle must appear in groups of four for symmetric linear-phase 
filters, zero pairs can leave the unit circle only by joining with another pair. 
This requires relatively severe quantizations (enough to completely remove 
or change the sign of a ripple in the amplitude response). This "reluctance" 
of pole pairs to leave the unit circle tends to keep quantization from 
damaging the frequency response as much as might be expected, enough so 
that cascade structures are rarely used for FIR filters. 


Exercise: 


Problem: What is the worst-case pole pair in an IIR digital filter? 


Solution: 


The pole pair closest to the real axis in the z-plane, since the complex- 
conjugate poles will be closest together and thus have the highest 
sensitivity to quantization. 


Quantized Pole Locations 
In a direct-form or transpose-form implementation of a second-order 
section, the filter coefficients are quantized versions of the polynomial 
coefficients. 
D(z) = 2? +a1z+ a2 = (z—p)(z—p) 
= —ay ac /ay2 _ Aa» 
7 2 
p=Te 


D(z) = z? — 2rcos(0) +r? 


So 


Thus the quantization of a; and ag to B bits restricts the radius r to 


r = \/kAg, and a, = — (2%(p)) = kAg The following figure shows all 
stable pole locations after four-bit two's-complement quantization. 


Note the nonuniform distribution of possible pole locations. This might be 
good for poles near r = 1, 6 = 4, but not so good for poles near the origin 


or the Nyquist frequency. 


In the "normal-form" structures, a state-variable based realization, the poles 
are uniformly spaced. 
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This can only be accomplished if the coefficients to be quantized equal the 
real and imaginary parts of the pole location; that is, 


a, = rcos(@) = Rr) 
a2 = rsin(@) = 3(p) 


This is the case for a 2nd-order system with the state matrix 
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Equation: 


det (zI-— A) = (z-—a1)? +a? 
= 27 —2a;z+a,7+ a2” 
= z* — 2rcos(6)z +r? (cos*(9) + sin?(8)) 
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Given any second-order filter coefficient set, we can write it as a state-space 


system, find a transformation matrix T such that A = T’~/ AT is in normal 
form, and then implement the second-order section using a structure 
corresponding to the state equations. 


The normal form has a number of other advantages; both eigenvalues are 
equal, so it minimizes the norm of Az, which makes overflow less likely, 
and it minimizes the output variance due to quantization of the state values. 
It is sometimes used when minimization of finite-precision effects is 
critical. 

Exercise: 


Problem: What is the disadvantage of the normal form? 


Solution: 


It requires more computation. The general state-variable equation 
requires nine multiplies, rather than the five used by the Direct-Form II 
or Transpose-Form structures. 


Limit Cycles 


Large-scale limit cycles 


When overflow occurs, even otherwise stable filters may get stuck in a 
large-scale limit cycle, which is a short-period, almost full-scale persistent 
filter output caused by overflow. 


Example: 
Consider the second-order system 
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with zero input and initial state values zo[0] = 0.8, z1[0] = —0.8. Note 
y[n] = zo[n + 1]. 
The filter is obviously stable, since the magnitude of the poles is 


57 = 0.707, which is well inside the unit circle. However, with 


wraparound overflow, note that y[0] = z[1] =  — 5 (-¢) =%=-—< 
, and that z,[2] = y[1] = (oe) _ =4 = = = = SO 


Yl| =e es —3, — ... even with zero input. 


Clearly, such behavior is intolerable and must be prevented. Saturation 
arithmetic has been proved to prevent zero-input limit cycles, which is one 
reason why all DSP microprocessors support this feature. In many 
applications, this is considered sufficient protection. Scaling to prevent 
overflow is another solution, if as well the inital state values are never 
initialized to limit-cycle-producing values. The normal-form structure also 
reduces the chance of overflow. 


Small-scale limit cycles 


Small-scale limit cycles are caused by quantization. Consider the system 


Note that when a2 > Zz — AB rounding will quantize the output to the 
current level (with zero input), so the output will remain at this level 
forever. Note that the maximum amplitude of this "small-scale limit cycle" 
is achieved when 
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In a higher-order system, the small-scale limit cycles are oscillatory in 
nature. Any quantization scheme that never increases the magnitude of any 
quantized value prevents small-scale limit cycles. 


Note:Two's-complement truncation does not do this; it increases the 
magnitude of negative numbers. 


However, this introduces greater error and bias. Since the level of the limit 
cycles is proportional to Ap, they can be reduced by increasing the number 
of bits. Poles close to the unit circle increase the magnitude and likelihood 


of small-scale limit cycles. 


Scaling 


Overflow is clearly a serious problem, since the errors it introduces are very 
large. As we shall see, it is also responsible for large-scale limit cycles, which 
cannot be tolerated. One way to prevent overflow, or to render it acceptably 
unlikely, is to scale the input to a filter such that overflow cannot (or is 
sufficiently unlikely to) occur. 


In a fixed-point system, the range of the input signal is limited by the 
fractional fixed-point number representation to |z|n]| < 1. If we scale the 
input by multiplying it by a value 6,0 < 6 < 1, then |8z[n]] < £. 


Another option is to incorporate the scaling directly into the filter 
coefficients. 
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FIR Filter Scaling 


What value of @ is required so that the output of an FIR filter cannot overflow 
(Vn : (ly(m)| < 1), Vn: (Ja(n)| < 1))? 
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Alternatively, we can incorporate the scaling directly into the filter, and 
require that 


to prevent overflow. 


IIR Filter Scaling 


To prevent the output from overflowing in an IIR filter, the condition above 
still holds: (UZ = oo) 


so an initial scaling factor 6 < can be used, or the filter itself can 


as 
Vo lh(A)| 
be scaled. 


However, it is also necessary to prevent the states from overflowing, and to 
prevent overflow at any point in the signal flow graph where the arithmetic 
hardware would thereby produce errors. To prevent the states from 
overflowing, we determine the transfer function from the input to all states 2, 
and scale the filter such that Vi : (o7"., |hi(k)| < 1) 


Although this method of scaling guarantees no overflows, it is often too 
conservative. Note that a worst-case signal is x(n) = sign (h(—7n)); this 
input may be extremely unlikely. In the relatively common situation in which 
the input is expected to be mainly a single-frequency sinusoid of unknown 
frequency and amplitude less than 1, a scaling condition of 


Vw : (|H(w)| < 1) 


is sufficient to guarantee no overflow. This scaling condition is often used. If 
there are several potential overflow locations 7 in the digital filter structure, 
the scaling conditions are 


Vaurs (Ca) | <1) 


where H;(w) is the frequency response from the input to location 7 in the 
filter. 


Even this condition may be excessively conservative, for example if the input 
is more-or-less random, or if occasional overflow can be tolerated. In 
practice, experimentation and simulation are often the best ways to optimize 
the scaling factors in a given application. 


For filters implemented in the cascade form, rather than scaling for the entire 
filter at the beginning, (which introduces lots of quantization of the input) the 
filter is usually scaled so that each stage is just prevented from overflowing. 
This is best in terms of reducing the quantization noise. The scaling factors 
are incorporated either into the previous or the next stage, whichever is most 
convenient. 


Some heurisitc rules for grouping poles and zeros in a cascade 
implementation are: 


1. Order the poles in terms of decreasing radius. Take the pole pair closest 
to the unit circle and group it with the zero pair closest to that pole pair 
(to minimize the gain in that section). Keep doing this with all remaining 
poles and zeros. 

2. Order the section with those with highest gain (argmax |H;(w)|) in the 
middle, and those with lower gain on the ends. 


Leland B. Jackson has an excellent intuitive discussion of finite-precision 
problems in digital filters. The book by Roberts and Mullis is one of the most 
thorough in terms of detail. 


